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Abstract: In this paper we show how the theory of dynamical systems can be employed 
to solve problems in motion vision. In particular we develop algorithms for recovery of dense 
depth maps and motion parameters using state space observers or filters. We begin with a 
review of previous, related work followed by an overview of relevant aspects of the theory 
of dynamical systems. Three dynamical models of the imaging situation are presented: 

• In the first model we assume that motion is known and a reflectance model for the 
surface is given. Depth is recovered directly from brightness measurements with a 
nonlinear Kalman filter. 

• The second model assumes that a planar surface is being viewed. Motion and depth 
are recovered with a nonlinear observer. 

• The third model makes no explicit assumptions about motion or surface. It is embed- 
ded in a system that iteratively improves an estimate for both the motion parameters 
and a dense depth map using Kalman filters. 

No feature- matching preprocessor is required. Solutions to other motion vision problems 
(tracking, camera parameter estimation) using dynamical systems theory are suggested. 
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1 Introduction 

Research in motion vision has previously focussed on the interpretation of two successive 
image frames in which either the camera or parts of the environment are in motion. Conse- 
quently, the results of these methods are instantaneous and make no use of the redundancy 
inherent in a series of frames. Examples of such instantaneous analyses include: estima- 
tion of the optical flow (Hildreth [19], [20], Horn and Schunck [27], Nagel and Enkelmann 
[40], [39], [41], Anandan [2], Fennema and Thompson [12]), recovery of motion parameters 
(Longuet-Higgins and Prazdny [35], Negahdaripour, Weldon and Horn [42], [28], Weng, 
Huang and Ahuja [64], Fennema and Thompson [12], Tsai and Huang [55], [56], Waxman 
and Wohn [63], Roach and Aggarwal [47], Prazdny [45], Waxman and Ullman [62]) and 
depth measurement (Kanatani [31], Prazdny [45], Waxman, Sinha and Ullman [62], [61], 
and Mitiche [38]). 

Recently, some proposals have been made for making use of more than two frames out 
of an image sequence. Additional frames contain redundant information that may provide 
additional constraints on an otherwise ill-posed problem. 

Several categories of multiframe motion algorithms have evolved: A first distinction can 
be made with respect to the format of the input data. 

• Feature-based methods. These methods assume that the localization and correspon- 
dence of features has been established beforehand by some other module. Since 
features are usually sparse depth cannot be recovered densely. 

• Non-feature-based methods. In most of these algorithms a brightness constancy as- 
sumption similar to the one first derived by Horn and Schunck [27] is made to deter- 
mine how the projections of real world points move in the image plane. 

The second distinction deals with the input data organization. The approaches taken 
here are 

• Recursive algorithms. These methods commonly maintain some current estimate, 
accept one frame of input data at a time and continuously update the estimate. 

• Global (batch) algorithms. Here, all of the input data is accepted at once and some 
global criterion is optimized as opposed to the local improvement achieved by the 
recursive schemes. The main disadvantage, however, is the necessity of storing large 
amounts of data and not being suitable for "on-line" applications. 

One example of a structure-from-motion algorithm that exploits information in a se- 
quence of frames is Ullman's incremental rigidity scheme [58]. In this scheme a set of 
feature points is tracked through a sequence of images. Beginning with an initially flat 
model the algorithm incorporates the incoming information about the displacement of the 
features and updates the model while maximizing its rigidity. The incremental rigidity 
scheme is a recursive feature-based method and therefore suitable for on-line recovery of 
depth information in sparse locations. 

Another feature-based algorithm is the one presented by Sethi and Jain [49]. They 
assume that a number of features has been localized in each frame of an image sequence. 



They then show how the smoothness of motion can be employed to find correspondences 
among these features i.e. they propose a solution to the correspondence problem over a set of 
motion frames. Using a greedy optimization algorithm, they address the following problem: 
given m points in each of n frames, determine m trajectories. The solution is found by 
minimizing a functional capturing a general smoothness constraint for the trajectory path. 

Tsai [54] presents a surface reconstruction procedure for known viewing orientations 
extended to n viewing positions. He shows that the multiframe method provides higher 
accuracy while only slightly increasing the computational complexity. This will be the deci- 
sive efficiency measure for multiframe algorithms: can the accuracy be increased sufficiently 
with respect to the additional computational cost. 

Tracking is another application of multiframe motion vision. Usually the idea is to 
control the motion of a camera viewing a certain moving object such that the object image 
remains nearly fixed in the image plane. This application is mostly feature-based and an 
example of a purely recursive scheme. There are numerous publications on this subject. 
Extensive reference lists are provided by [10] and [3]. 

Shariat and Price [50] use point correspondences in more than two frames to estimate the 
motion of a moving object. They exploit the fact that after compensating for translation, 
points on the observed object will move in circles in space according to the rotational 
velocity. The motion estimation problem can then be reduced to the problem of computing 
circle parameters from point correspondences. This approach can be classified as a feature- 
based global solution since measurements from all frames must be available before the 
parameters can be computed. 

Another recursive depth estimation procedure for more than two frames was suggested 
by Bharwani, Riseman and Hanson [5]. A correlation based method similar to the one 
employed in section 6 is used to estimate displacements of features between successive 
frames. Under the assumption of known translational motion the depth is then easily 
recovered. This recursive approach is not really restricted to identifiable features and pure 
translation but it becomes much easier to handle mathematically. An important feature of 
the algorithm is that it exploits the fact that motion is known to predict the displacement in 
the next frame and thereby limit the search in the correlation- based estimator. However, a 
more physically motivated strategy for predicting and updating the estimates that involves 
the minimization of measurement errors seems necessary. In addition, we would like to 
avoid restricting the motion to known translation. 

The above methods employ different approaches for dealing with the data sampled over 
time, which are largely motivated by physically intuitive constraints imposed upon the 
problem (for instance rigidity of the object, smoothness of the trajectories). Researchers 
have been looking for a more systematic way in which to incorporate the temporal relation- 
ship among the successive measurements. Recently problems in motion vision have been 
formulated in a way that deals explicitly with the time-dependency of the observed data. 
Such a formulation is provided by the theory of dynamical systems. 

One of the earliest applications of dynamical systems methodology may be found in 
Stuller and Krishnamurthy [52]. As we will see, a Kalman filter is a means of estimating 
states of a dynamical system when only a function of these states is measurable as output. 
Stuller and Krishnamurthy formulate the projected translational motion of an object in 



terms of dynamical systems equations in which the interframe displacements appear as 
states so that a Kalman filter can be employed to estimate them. The authors present 
experimental applications of their method. Apart from the restriction to translational 
motion the approach has a severe flaw: the underlying dynamical equations constitute a 
merely heuristic association of the variables needed for the estimation. When mathematical 
models are not based on laws of physics their relationship to reality becomes unclear. 

Besides investigating the relationship between the motion-field and the optical flow, 
Verri and Poggio [60] noted that the motion field, the projection of the 3D velocity field 
into the image plane, is a vector field tangential to the trajectories of a dynamical system 
and that the qualitative properties of this dynamical field are reflected in the properties of 
the optical flow field that is commonly recovered from brightness variations. They give an 
overview of the theory of dynamical systems from the theoretical perspective of [21]. 

Verri, Girosi and Torre [59] establish a formal correspondence between the theory of 
dynamical systems and motion vision. Their basic observation is that the motion field 
equations form a two-dimensional dynamical system. Properties of this dynamical system 
can be inferred from its behavior near singular points that may appear in images as a 
focus of expansion, for instance. They show how the type of motion as well as quantitative 
measurements of rotation and translation may be determined given the optical flow near 
the singular points. 

Direct applications of the theory of dynamical systems to the estimation of object 
kinematics and structure in motion vision have been discussed repeatedly by Broida and 
Chellappa. In [6] the authors present a dynamical system that describes the temporal 
variation of 3D coordinates and velocity. They further present a model of the measurement 
process corrupted by noise. Given such a formulation, the technique of optimal filtering is 
applied to estimate depth and velocity. The scheme suffers from an extremely simplified 
description of object motion (only two real- world points with known position relative to the 
center of mass are described by the model), the motion is planar (translation and rotation 
in a plane only) and the measurement equations assume known interframe correspondences 
(no relationship to the measurable brightness values is established). 

In [7] the model is generalized to n feature points but a known matching between the 
frames is still assumed. Again the iterated extended Kalman filter is employed to perform 
the recursive estimation. Unit quaternions are used to represent interframe rotation. No 
application of the technique to real data is mentioned, presumably because the preconditions 
are yet too restrictive. The authors have also proposed a corresponding global solution [8] 
which again shows the duality of the two approaches. 

Interesting practical results are presented by Dickmans [11]. The objective in this paper 
is to recover real-world coordinates and motion parameters from the measured image point 
coordinates of distinguished feature points. This requires assumptions about the underlying 
dynamics of the object being viewed to obtain a complete dynamical systems description, 
which is then used in a Kalman filter estimation procedure. Although the paper does not 
present the exact form of the models employed, extensive experiments involving docking 
procedures for a vehicle, landing of aircraft and guidance of motor vehicles are reported. 

Recently Matthies, Szeliski and Kanade [37] have presented ideas very similar to our 
own. They propose a non-feature-based recursive dense depth estimation based on Kalman 



filtering. Motion is assumed to be known, purely translational and parallel to the image 
plane. A correlation estimator is used to compute displacements of image points between 
successive frames. If motion is known, depth can be recovered in a straightforward manner. 
The key idea is to use the Kalman filter to update a dense depth map given these correlation- 
based measurements for every new frame. Although severely restricted by the necessity to 
know motion, encouraging experimental results are presented. 

The main intention of this paper is to introduce a rigorous dynamical systems framework 
for the motion vision process. This begins by realizing that there is no one dynamical system 
that describes the behavior of successive frame measurements "correctly". The structure 
of the system depends crucially on the assumptions made about the imaging situation, 
which are mainly reflected in the output equations. We describe three different dynamical 
models of motion vision appropriate for various imaging situations. As an example of how 
techniques used in the control of dynamical systems may be employed in motion vision we 
develop recursive algorithms which do not require the existence of feature-correspondence 
for the estimation of dense depth maps and motion parameters using Kalman filters. We 
stress the importance of a rigorous physical motivation of the model of the imaging process, 
which is a major flaw of many previous approaches. 

We proceed as follows. The subsequent section introduces the terminology and notation 
of dynamical systems. We then describe our choice of coordinate systems and derive the 
basic motion equations for both the continuous and the discrete case. Then three dynamical 
models of the imaging situation are presented and corresponding estimators for depth and 
motion are derived. 

2 Dynamical Systems 

This section contains an overview of the relevant material from the theory of dynamical 
systems. The reader familiar with the topic may wish to skip this section. Literature in 
this area is readily available. [36], [14] and [30] provide introduction to the general theory 
dealing mainly with continuous, linear systems. Discrete systems - of particular importance 
in our case of a sampling camera system - are discussed in [32], [44], [65] and [13]. Special 
topics such as identification may be found in [43] and [34]; for filtering a widely used 
reference (cf. [52], [6], [7]) is the one by Gelb [16]. 

2.1 Basic Concepts 

We present here the state space model of dynamical systems. In the most general case 
such a system is described by a first-order vector differential equation 

x(*) = f(x(*),u(*)) (1) 

and an algebraic vector equation 

y(0 = g(x(t),u(0). (2) 

Equation (1) is referred to as the plant equation that describes the dynamic behavior of the 
system. It states that the temporal variation of the state vector x(t) depends (possibly in a 



nonlinear fashion) on x(t) itself and an input vector u(t). The output equation (2) models 
the measurement process in which the output vector y(t) is obtained as a function of the 
current state and the input. 

Linear systems are a subset of the above, the model being 



k(t) = Ax(i) + Bu(i) 
y(«) = Cx(<) + Du(t). 



(3) 
(4) 



where A, B, C and D are matrices. Most available methods operate on linear systems as 
we will see. We often visualize dynamical systems using block diagrams as in figure (1). 
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Figure 1: Block diagram representation of dynamical systems 



The major task in the application of dynamical systems theory is constructing an ap- 
propriate model. We thereby encounter a tradeoff between the model's complexity and the 
accuracy with which it represents the actual dynamics. 



2.2 An Example 

Consider a mass m attached to a spring of stiffness k. The attachment of the spring 
may be moved by external influence by an amount u(t). Finally we let x(t) denote distance 
between the mass and the attachment. The arrangement is depicted in figure (2). 
Summing up the forces on mass m provides the equation 



mx — mg — k(x — u). 



(5) 



By introducing x = x - mg/k, which simply describes the distance from the equilibrium 
x = mg/k, we obtain 

x = x H u. (6) 
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Figure 2: A simple dynamical system 



Now we use the dummy variable v — x (simply the velocity) to convert the second-order 
ODE into two first order ODE's which are the desired plant equations: 
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(7) 



We see that the plant equations are linear corresponding to (3). We can measure the 
behavior of this system by the position of the mass, which is simply the first component of 
the state vector. So the output equation is 



2/ = * = [1,0]: 



(8) 



2.3 Discrete Systems 

For many systems, output and state are only defined at discrete points in time. We may 
also think of such systems as the result of a continuous system being sampled periodically. 
Assuming system (3), (4) being subject to sample and hold of period T, we can write 



Xfc+i = $Xjfc + Tu fc 

y k = Cx k + Dun 

where x^ is a shorthand for x(Tfc), k = 0, 1, . . . and 
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(9) 
(10) 



(11) 
(12) 



This description is suitable for dynamical systems involving digital computers, especially 
for processing of image frames taken at discrete points of time. 

2.4 Linearization 

Unfortunately, many realistic problems do not lead to such simple linear differential 
equations but appear in the more general form of (1), (2). Very few methods for the tasks 
discussed in the following sections are available, which operate directly on the nonlinear 
description. A remedy is to linearize the system by using a Taylor series expansion about 
some fixed point xq, uq: 



f(x, u) = f(x , u ) + ^ 



n 9f 

(X- Xo)+ -K- 

xo ' dw 



(u-u ) + --- (13) 

U 



If we assume that state and input vary only slightly around the point about which f has 
been expanded, we may neglect higher-order terms and write 

Ax=AAx+BAu (14) 

where we introduced Ax = x — xo, Au= u — uo and 

(15) 



tfx 



and B = 7— 

xo du 



Uo 



are Jacobians of f . A similar argument leads to a linearized output equation and we may 
apply linear techniques to the transformed system. 

2.5 Stability 

We differentiate two definitions of stability (see [15], [30], [36]): 
External (BIBO) stability: A dynamical system as in (1), (2) is said to be externally stable 
if a bounded input function u(t) results in a bounded output function y(t). 
Internal (asymptotic) stability: A linear dynamical system as in (3), (4) is said to be 
internally stable if the solution of 

k(t) = Ax(() (16) 

tends toward zero as t — ► 00 for arbitrary initial condition x(tfo)- 

A necessary and sufficient condition for internal stability (also called asymptotic stability) 
is that all eigenvalues of A have negative real parts. An internally stable system is also 
externally stable. Lyapunov and Poincare proved that a nonlinear system is stable with 
respect to small perturbations from an equilibrium Xo if the system matrix of the system 
linearized about xo is stable. 

2.6 Control 

In the broadest sense, control is the augmentation of a given dynamical system by ad- 
ditional systems in order to achieve a desired overall behavior. The most common control 
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Figure 3: Feedback control 



structure is feedback where the state x is transformed by a compensator matrix R, com- 
pared with a commanded input vector u c and the difference used as input u to the plant 
as in figure (3). 

We then have that 

u = u c — Rx (17) 

which gives us the overall plant equation of 



x = ( A - BR)x + Ru c 



(18) 



To influence the overall behavior of this system we may choose the matrix R. The properties 
of a dynamic system are determined by the position of the eigenvalues of the system matrix 
(cf. stability). A common criterion for the choice of the elements of R is therefore to achieve 
desired locations of the poles of A — BR. This is called pole placement. 



2.7 Observers and Filters 

We are often interested in knowing the state x(t) of our dynamical system but can only 
measure the output y(t), for instance in the case of feedback control. An observer is a 
dynamical system that reconstructs the state x(t) given the output y(t) and the input u(t) 
when the dynamics of plant and output are known (figure (4)). 
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Figure 4: The observer principle 

In the linear case (3), (4) we use the following observer structure 

x = Ax + Bu + E(y - Cx - Du) 



(19) 



where x denotes our estimate for the state vector x and E is a matrix that we may choose 
freely. We see that this is a copy of the original plant equation augmented by a term for 
the error between estimated and actual measurement value. A criterion for the choice of E 
may be derived from the error e = x — x. From (19) and substituting (4) for y we have 



e = (A - EC)e 



(20) 



which means that the dynamics of the error depend on the eigenvalues of A — EC. We 
may therefore influence the dynamics of this error using pole placement techniques on this 
matrix. We will at least require this matrix to be asymptotically stable in order for the error 
to disappear eventually. For discrete systems the error can vanish after a finite number of 
sampling periods (dead beat response). This can be achieved by placing all poles of the 
discrete observer at the origin. 

A Kalman filter is derived from an observer of a linear system as in (3), (4) where 
state and output are corrupted by Gaussian noise of a known distribution. State, input 
and output are interpreted as random processes with a known distribution of the initial 
state vector x(0). The filter then consists of two stages for every iteration: a prediction 
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phase in which the next state vector is computed by using the current state estimate in 
the state equations and an update phase in which the current estimate is altered according 
to the error between expected and actual measurement just as we have seen for observers. 
The key difference between observers and Kalman filters is that the gain matrix E depends 
on the variance of the measurements, or in other words, a discrepancy between predicted 
and actual measurement influences the estimate in proportion to the confidence we have 
in that measurement. This provides the filter with the ability to estimate optimally the 
state vector in the sense of minimal noise corruption, but places the burden of determining 
the variances on the designer. This methodology may be extended to nonlinear systems by 
linearizing the state and output equations around the current estimate in every iteration. 
This is known as the Extended Kalman Filter (refer to [16]). We have summarized the 
equations of the Extended Kalman Filter below. Observe that they reduce to a simpler 
form in the special case of linear system. 



The Extended Kalman Filter 



(21) 



System Model 
Output Model 


x.k+1 = f(x fc , u fc ) + Wfc w fc ~ JV(0, Qj.) 
y fc = g(xfc) + v fc v* ~ N(0, R fc ) 


Initial State 
Non-correlation 


x ~iV(xo,Po) 
E[w k vl] = for all k 


State Estimate Propagation 
Error Covariance Propagation 


P*+i = **P**fc+Qfc 


State Estimate Update 
Error Covariance Update 
Filter Gain 


k t = *k + E *[y*-g(* fc )] 

P+ = (I-E fc C#)P* 



x fc 



We have used the abbreviations $ k = |i . for the system Jacobian, Cfc = $& 

for the output Jacobian and x ~ N(/j,,<j) for a normally distributed random variable x 
with expected value fi and variance a. The Kalman filter forms the core of the estimation 
algorithms for motion vision presented below. 



3 Dynamical Systems for Motion Vision 

Before formally modelling the imaging situation as a dynamical system we provide some 
insight into why such an interpretation may apply. Recall that a dynamical system consists 
of a state that changes in time according to a plant equation (1) and a measurable output 
which is a function of that state at every instant (2). In the imaging situation in motion 
vision the position and attitude of an object being viewed changes in time according to the 
kinematic equations. Only a function of position and attitude, namely the projection into 
the image plane, can be measured at every instant. We could identify position and attitude 
as the states of our system, the kinematic equations as the plant and the 2D projection as 
our measurement. 
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However, the interpretation is not quite that simple. We are not really measuring the 
position and orientation of the 2D projection. We merely have an array of brightness values 
available in every frame. One approach would then be to determine first the location and 
orientation of the projection from the grey-level array and use the result as a new "meta"- 
measurement. This leads to the feature-based approaches mentioned in the introduction. 
An alternative is to make assumptions about the formation of the brightness value and 
in particular how a change in brightness relates to a change in position and orientation. 
The most common assumption is the brightness change constraint equation of Horn and 
Schunck [27]. The idea there is that the brightness of a patch corresponding to a fixed 
location on the surface being viewed will not change significantly from one image to the 
next. 

A conclusion we can draw from these thoughts is that the major emphasis in finding 
suitable dynamical models for the motion imaging situation will be on constructing appro- 
priate measurement models. Since there is no unique model for a given system a major task 
is to investigate different models and identify those that represent the temporal behavior 
most accurately while remaining at a low level of complexity (nonlinearities, order of the 
system, etc.). 

We have constructed three models of the imaging situation that differ mainly in the 
type of measurement assumed. All models include the depth (distance to the object) as 
a state that is not measurable in the output. The main idea is to employ a state space 
observer or a Kalman filter as described in section 2.7 to estimate the depth. One would 
also like to recover the motion of object or observer. Note that an observer/filter solution 
would then require the presence of some plant equation describing the temporal evolution 
of the motion parameters. In the case of object motion this is rarely ever known and we 
can only hypothesize a model as done in section 5. When the camera is in motion we may 
use the dynamics of the actuating device to obtain such a relationship as we suggest in 
section 7. We believe that the best procedure is to estimate depth with a Kalman filter 
and use the resulting depth map to compute motion in a least-squares fashion. This is the 
approach taken in section 6. 

The first model we present is in a sense the physically "correct" model as it attempts 
to establish the relationship between the measured brightness value variation and the ob- 
ject/observer motion. This requires an assumption about the reflectance properties of the 
surface (lambertian, specular etc.) and is therefore rather restrictive. We also assume that 
motion is known. A dense depth map is recovered using an Extended Kalman Filter. 

The second model employs the brightness change constraint assumption to avoid mod- 
eling the surface reflectance and illumination conditions explicitly. Since the brightness 
change constraint equation is an implicit equation, this requires a modification of the gen- 
eral observer scheme. The other important characteristic of this approach is that the surface 
being viewed is assumed to be planar, which allows us to parametrize it and recover the 
surface parameters rather than a dense depth map. We also show how the use of a simple 
motion model permits us to recover motion within the observer. 

Section 6 finally presents a system that estimates both motion and depth recursively. 
A dense depth map is obtained either directly from brightness values or from a correlation- 
based displacement measurement using the current estimate for the motion parameters. 
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This constitutes the update section of a non-linear Kalman filter. The new depth map 
is then used to find a new least-squares estimate for the current motion and a predicted 
depth map for the subsequent frame. This is formulated as the prediction stage of a Kalman 
filter. This scheme requires no assumptions about surface reflectance, shape, illumination, 
or motion (besides the ones inherent to the brightness constancy assumption, see Verri and 
Poggio [60]) and is therefore best suited for implementation and processing of real imagery. 
The remainder of this section is devoted to the introduction of basic terminology for 
the imaging situation, choice of coordinate system, etc., which is similar to that of Horn 
[25] and Negahdaripour [42]. 

3.1 Coordinate Systems 

We will use a coordinate system with origin coincident with the center of projection of 
the camera, the z-axis pointing towards the image plane. The image plane is parallel to the 
x-y-plane at unit distance from the origin i.e. we express image-plane coordinates in units 
of focal length. The situation is shown in figure (5). 



Object 




Center of Projection 



Image Plane (z = 1) 



Figure 5: The viewer centered coordinate system 



A point on an object in the real world is represented by the vector 

R = [X,Y,Z] T 



(22) 
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and the projection of that point into the image plane by 

r = [x,y,lf. (23) 

The perspective projection equation contains the relationship between both points: 

r = JL = * (24) 

Rz Z y ' 

where z denotes a unit vector along the z-axis. 

3.2 Continuous Motion 

We assume that object and camera are in relative motion given by a translational 
vector t(i) and a rotational vector u(t) in the coordinate system of the camera. In [18] we 
showed that the motion perceived in the image plane is the relative motion of object and 
camera if both object and camera motion parameters are defined in a coordinate system of 
the camera. We also showed that the generalization from instantaneous velocities t, u to 
velocity functions of time t(f), u>(t) is valid. 

Consider a point on the observed object with position vector R(i). The motion of the 
point on the object is given by 

R(t) = -t(t) - u(t) X R(i) (25) 

and the corresponding motion field in the image plane is 

r=-zx(rx(rxw-^r)). (26) 

R ■ z 

This is also referred to as the motion field equation. 

3.3 Discrete Motion 

We will discover that although the above continuous formulation accounts precisely for 
the changes due to motion, a discrete description is better suited for the purpose of con- 
structing a computer algorithm to implement dynamical systems techniques. This means 
that we must use finite rotations and translations instead of the infinitesimal quantities t 
and u. There are numerous ways of representing finite interframe transformations. We will 

o 

use a displacement vector p for the translation and a unit quaternion Q for the rotation. 
Unit quaternions are an efficient means of representing rotations, used in vision by [7], [22], 
[24]. Information on the subject is available in [51], [53], [22] and [48] where the latter 
provides an excellent comparison of various representations of rotation. 

For our purposes it will be sufficient to understand a unit quaternion, which we will 
denote by a circle above a character in what follows, as a four-dimensional unit vector or a 
generalized complex number 





q = 



Qo 

L ?* 



= qo + iq x + jq y + kqz- (27) 
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Rotations about an axis along the unit vector u> = [u> x , u> y , u> z ] T by an angle 9 are represented 
by the unit quaternion 

q = cos - + sin -(iu) x + ju> y + ku> z ). (28) 

o 

Vectors are purely imaginary, scalars purely real quaternions. So if q denotes a rotation as 
above and x is the quaternion corresponding to some vector x then the rotated quaternion 
is 



o' O o o* 

x = qxq = 



q-q 

o {ql + ql - q 2 y - ql) z(qxq y -qoqz) i(q x q z + q q y ) 

2(q x q y -q Q q z ) (ql - ql + q 2 y - q 2 z ) 2(q y q z - q q x ) 

2(q s q z - q q v ) 2(q y q z - q q s ) {ql - q% - q 2 y + q 2 z ) 



(29) 



where q is the conjugate quaternion to q. 

How does a rotation and translation in this representation change a real-world posi- 

o _ o 

tion vector R^ amd its quaternion equivalent R^ at discrete time kl We rotate Rk by a 

o o 

quaternion kk using (29) and translate by a finite displacement quaternion Vk to obtain 

o 

Rk+i- 

0* 

Rk+i = Pk + qkRkq k (30) 



The discrete motion field (the displacement field r^+i — r^) is determined by 

O O /° o o* 

o _ Rk+i _ Pk + (gfc^fegfc)^fc (u) 

Zk ^ {{Pk + {%r k q k )Zk)-z) 

where z is a unit quaternion with z-component 1. The denominator of the last term is 
simply the last component of equation (30). 

4 A Discrete Model for Brightness Measurements 

The first decision when modeling a dynamical system is which quantities should be 
included in the state vector and which quantities are measurable as output. We have 
already indicated the importance of the measurement model, which we derive first in this 
section. We designate the brightness values E(x,y) to be the measurements in this model 
where x and y are image plane coordinates. The measurement model must then relate 
E(x, y) to the time- varying quantities in the system such as the location and orientation of 
the object being viewed. Depending on which relationship we obtain, we must then compile 
all time- varying quantities therein into the state vector and formulate a plant equation that 
governs their temporal evolution. 

Another major decision is whether to formulate the model discretely or continuously. 
Since the measurements are taken discretely and the Kalman filter is inherently a discrete 
algorithm, we decided to formulate the model discretely. 
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4.1 An Output Equation for Brightness Values 

We have designated the brightness values in the image to be our output or measure- 
ment. To derive a corresponding output equation for use in our dynamical system we must 
determine the relationship between the brightness E at some point (x, y) in the image plane 
and the time- varying parameters of the system such as position and attitude, which we later 
will include in the state equations. We need some basic physical facts about the imaging 
situation that we have taken from [25], chapter 10. 
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Figure 6: The formation of brightness values 



Consider a point P on an object whose position vector forms an angle a with the optical 
axis and whose radiance is L (figure 6). The image brightness at the projection of P into 
the image plane is 

£ = Z^) 2 cos 4 a (32) 

where d is the diameter of the camera lens and / is the focal length which we scale to 1 in 
what follows. The cosine can be expressed in terms of the real- world or image coordinates 

(33) 



cos a = 



so we have 



y/X 2 + Y 2 + Z 2 y/1 + x 2 + y 2 
Lwd 2 Z 4 



E 



4(X 2 + Y 2 + Z 2 ) 2 ' 



(34) 
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Unfortunately, L depends on the type of surface being viewed and the illumination. On 
the other hand it is intuitive that we will not be able to model image brightness values 
correctly without some assumptions about illumination and surface reflectance. The two 
most widely used surface models are Lambertian and specular. For the Lambertian surface 
we have (see for instance [25]) 

L = p(i- n) (35) 

where p is the surface albedo, i is a unit vector in the direction of the light source and n is 
a unit vector along the surface normal of the point being viewed. 

Several models have been proposed for specular surfaces. The one developed by Tuong- 
Phong [57] and applied to computer vision by Horn [26], [23] models the radiance of a 
surface illuminated by an extended source as 

L = A(^p)" (36) 

where A is a specular reflectance coefficient, n a parameter that describes the compactness 
of the specular patch, and s is a unit vector in the direction of perfect specular reflectance 
which can be written as 

s = 2(i • n)n - i. (37) 

In practice a linear combination 

J- 1 = ''■'-'lambertian T \± «) '-Li specular \"°) 

provides a good approximation in which t may vary spatially. 

Using the expressions for the radiance (35), (36) in the equation for the image brightness 
(34) yields 

and 

E = iw^ (2(i ' n)(R ' n) " (i ' R))n - (40) 

We see that in both cases E depends on the time- varying quantities R (the position vector) 
and n (the surface normal). This means we must include these variables in our state vector. 
We note that in both the specular and the Lambertian case, we can use r instead of R 
in the expressions for E. The time dependency of r is more complex than the one for R as 
we see by comparing (25) and (26). For this reason and for the sake of variety we use R 
here. We also focus on the simpler Lambertian case in what follows, although the specular 
case is analogous. 

4.2 The Model 

Our output equation (39) dictates that the state vector must contain at least R and 
n. We must therefore determine how these vectors change due to motion. Our experience 
with the previous model revealed the advantages of a discrete motion representation. We 
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assume that the camera undergoes a translation Pk and rotation Qk at time k. What is the 

O O 

state equation for Rk, i.e. how does R k +i depend on Rkl This is simply the object motion 
equation (30): 

kk+i =Pk + VkhkOk- ( 41 ) 

We must determine the analogous relationship for the change in the unit surface normal 
nf.. A unit normal remains unchanged under translation of the coordinate system - it will 



merely be rotated by 9 according to 



o o o ° 

»*+i = %n k q k . 



We can now summarize our model from (41), (42) and the output from (39) as 



Xfc+l = 



Rk+i 



O 0* 

Pk + QkRkVk 

O o o* 



= f(Xfc,Ufc) 



y k = E k = 



P*d 2 Zt 



4|R*| 



f(i • n fc ) = g(x k ) 



(42) 



(43) 



(44) 



O rp O O o 

where the input is u^ = [Pk,9k] ■ This model contains some redundancy since R, P and n 
are vectors, i.e. purely imaginary quaternions (we can omit their first components in the 

o 

model), n is a unit vector (we only need two of its components) and <Z is a unit quaternion 
(we only need three components to represent it). So a minimal model could have the state 
and input 



Xfc = 



X k 
Y k 
Z k 

n x k 
n yk J 



u* = 



Pxk 
Pyk 
Pzk 
Qxk 
iyk 
L Qzk 



(45) 



This dynamical system describes the change in brightness for the projection of a particular 

o o 

point at position vector R as a result of translation P and rotation <7. As before, we are 
interested in recovering the depth Zk and finding a filter suitable for performing this task. 

4.3 Estimating Depth using an Extended Kalman Filter 

The extended Kalman filter is an observer for a nonlinear system corrupted by noise 
with certain probabilistic assumptions about the distribution of the noise and the initial 
values. The filter equations are summarized in table (21). 

We assume that our system (43) is corrupted by noise w^ which is N(0, Q k ) distributed, 
so that 

x fc+ i = f(x fe , Ujt) + w*. (46) 
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Similarly, the measurements are influenced by noise v k which is N(0,r k ) distributed. The 
measurement equation becomes 

Vk = fif(xfc) + v k . (47) 

We finally assume that the initial value of the state Xo is normally distributed around our 
initial guess xo with known covariance Po and that the noise processes influencing state 
and output are uncorrelated: E[w k v k ] — for all k. 

We can make these assumptions in our particular case but must deal with some of the 
issues explicitly before proceeding. A good initial state estimate i.e. xo, is not available 
without additional information. We can only guess it, but we can run our filter with 
several different initial values and select those that converge. Similarly, the covariance 
Po of the initial state estimate is unknown and must be guessed. The covariance Q of 
the state, however, may be assumed to be zero since the kinematic equations are geometric 
relationships and not subject to noise. We may have some knowledge about the uncertainty 
in our velocity information which enters into the kinematic equations. This in turn would 
result in a non-zero covariance. The measurement variance r is due to noise in the camera 
sensor. A noise model and the corresponding variance is presented in section 6. 

We find that the state estimate propagates as 



and the covariance as 



p k+1 = * k pt*k+Q k - 



Between propagation we update the estimates from k k to x^ according to 



** = x fc + e k [y k - g(Z k )] 



and the covariance according to 



P+ = (I-e fc cf)P; 



where the Kalman filter gain is computed as 



e k = P k c k [c k v P k c k + r k ] 



-l 



(48) 
(49) 

(50) 
(51) 
(52) 



at 

5x 



and c k = 
x* 



x* 



These equations contain the Jacobians & k 

In order to apply this algorithm we need merely compute the matrices $ and c. In the 
case of the minimal system with state and input from (45) we have 



2(q x q y - qoq z ) 
%*92 - qoq y ) 

o 

o 



2(?*? v - qoqz) 

(?o -ql + q 2 y - ql) 

2(q y qz - qoq x ) 

o 

o 



2(g*?* + qoq y ) 
z(q y qz - qoqx) 

{ql -ql-q 2 y + ql) 
o 
o 



(53) 
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{<& + ql-il- il) 

2(9*?v - qoqz) (?o 







2(9*92/ - 9oQz) 



ql + vl 



ll) 



(54) 



and 



c = 



pircPZ 4 



111 \r 

w 

i-n v 

in Jf 2 +K 2 
jKp. z 

**. 
4 



(55) 



where we have omitted the hat as an indication of estimation and the time index k. 

Let us finally consider how the measurement and filter process will be implemented. 
The plant equation (43) describes how a point P and the surface normal at that point 
change relative to the camera due to motion. The Kalman filter algorithm will therefore 
recover the depth Z of that particular point. Consequently, the brightness measurement 
used in the filter must be the brightness at the location of the projection of P into the 
image plane. Since this projected point moves throughout the measurement process, we 
would have to try to estimate the position of the point in every frame and then take the 
measurement there, which seems rather error-prone. To obtain a dense depth map, we 
must repeat this for all points in the region of interest. 

Instead, we will store the current depth estimate with every grid point of the image 
array and use the brightness value at that location to update the state estimate in the 
Kalman filter. The new depth estimate will then be associated with the location (x,y) = 
(X I Z , Y / Z) which is the position we expect the projection of P to move to according to 
our estimate. This position may not lie at a grid point so we must interpolate to obtain 
the values at the grid points. The details of this interpolation are discussed in section 6. 
We can summarize the filter algorithm (there is one filter for every point observed) as 
follows: 

(1) Set initial values: 

For every point 1 to m under consideration set xj = unless better initial guess available. 

(2) Filter loop: 

For frames k — 1 to n do 

For all points 1 to m do 

(2.1) Measure the brightness values E^. 

(2.2) Compute the system matrices <&* and c\. from equations (54) and (55) for the 
current state estimate x£ and the known input u^. 

(2.3) Compute the state estimate x£" +1 and the covariance P^ +1 for time k+ 1 from 
equations (48), (49). 
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(2.4) Compute the filter gain efc+i from equation (52). 

(2.5) Update the state estimate to x£ +1 and the covariance to P£ +1 using equations 
(50), (51). 

(2.6) Associate the new state with image location (^k+i/ ^k+i^h+i/ ^t+i) anc ' '"" 
terpolate the state estimates at integral grid point locations. 
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Figure 7: A block diagram of the Kalman filter 



A block-diagram of this algorithm is shown in figure (7). There is one such filter loop 
for every point in the region of interest. 
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4.4 Discussion 

We can easily see that the above algorithm runs in Q(nm) time where n is the number 
of frames and m is the number of points. This is clearly optimal for surface estimation that 
makes no assumption about the surface structure (i.e. we must look at every point in every 
frame). A parallel implementation of the algorithm could assign one processor to every 
image plane pixel, which can compute the observer matrices for those points in parallel. 
Every processor will also evaluate the new state estimate, which determines the location 
of the processor to operate on the pixel in the next frame. The current estimate must be 
transmitted to that processor. Since interframe changes will be small for high frame rates 
it is sufficient that locally neighboring processors be interconnected in order to execute this 
transmission efficiently. The parallel complexity is then merely 0(n). 

So besides being fast, the algorithm makes no assumptions about surface structure and 
can produce a dense depth map. The interesting properties (as well as the deficiencies) of 
the algorithm given above are summarized here: 

(1) We use a discrete model and avoid discretization errors. 

(2) The use of brightness values as output permits us to operate directly on the measured 
data but also requires knowledge of surface and illumination parameters. 

(3) We assume that motion is known. 

(4) Noise is modelled. 

(5) Depth is recovered using a nonlinear Kalman filter that is optimal in the presence of 
noise. 

(6) Efficiency and perhaps convergence rely on a good initial guess for the state vector. 

From this perspective we can also see why the filter algorithms given by Broida and 
Chellappa [6], [7] require feature correspondences between frames to be precomputed. In- 
deed if we had those correspondences here, the performance of the algorithm would increase 
significantly because we no longer have to estimate the position to which a given point will 
move in the next frame and interpolate the state estimates at grid points. Conversely the 
estimator can be used to help establish feature correspondences because we estimate where 
the feature has moved to in the subsequent frame. This is an additional interesting applica- 
tion. Only tests with an implementation will eventually reveal performance and accuracy. 
Problems (1) and (2) are inherent to the model and we attempt to alleviate them in what 
follows. 

5 A Model for a Planar Surface using the Brightness Change Constraint Equa- 
tion 

We have seen that the attempt to construct a precise physical model of the imaging 
situation forces us to make assumptions that cannot always be guaranteed for real images 
(for instance that we have lambertian surfaces). We will use the approximation of the 
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brightness change constraint equation in this section to avoid having to model the formation 
of brightness values in the image explicitly via surface reflectance properties. Another 
improvement the previous model calls for is to drop the requirement of known motion. 
This is addressed by assuming a very simple dynamical model of the motion which allows 
us to incorporate motion into the plant equation. Finally, we assume that a planar surface 
is being viewed which changes the nature of the problem from having to estimate a dense 
depth map to estimating a small number of parameters for the plane. 

5.1 The Discrete BCCE: An Implicit Output Equation 

Horn and Schunck popularized the use of the brightness change constraint equation 
(BCCE) 

f = (56) 

in their work on optical flow [27]. Negahdaripour, Weldon and Horn [29], [42] employed 
the BCCE to recover motion and surface parameters instantaneously. Here we investigate 
the use of the BCCE as an output equation of a dynamical system. The BCCE can be 
expressed in terms of image brightness derivatives 

E r • r + E t = (57) 

where E r = [§f , §f ,0] r and E t = |f. We wish to obtain a discrete form of the BCCE 
analogous to the previous equation. The Taylor series expansion of E provides 

AE = E r • Ar + E t At = (58) 

and assuming that we have measurements at discrete times k = 0,1,... we obtain Ar = 
rfc+i — r k and At = k + 1 — k = 1 so 

E rfe • (rjfe+i - r k ) + Etk = 0. (59) 

The discrete motion field r^. +1 is given in equation (31) in section 3.3. We can use this 
equation by reexpressing the above BCCE in terms of unit quaternions. This presents no 
difficulty since all vectors are simply replaced by corresponding purely imaginary quater- 
nions. Substituting (31) into (59) yields 

&>•( > + . (l ffi*'. -?») + £»-<» (60) 

((P k + (l k r k <l k )Z k ) ■ z) 

and after some rearrangement we have 


o o* Vu _ , 

(Q k r k q k + -±)-s k = (61) 

O O Q Q 

in which we have abbreviated s k = E rk + (E tk - E rk • r k) z = [0,E xk ,E yk ,E tk - x k E xk - 
y k E yk ] . This is the discrete brightness change constraint equation. 
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In this model we restrict ourselves to planar surfaces. The equation of such a plane is 

'— RTi = l (62) 

where n is a normal to the surface and l/|n| is the distance of the plane to the origin. 
Substituting the quaternion equivalent into the discrete BCCE yields 

(9kh9* k + (n fc • r k )Pk) • h = 0. (63) 

This relationship depends on a particular point in the image plane. Since we wish 
to use the BCCE to make statements about the plane and its motion that are obviously 
independent of particular points in the image plane, the question arises at which point in 
the image plane (63) should be evaluated. Since no point is distinguished we may choose 
one at random. A better approach is to sum up the values over a region of the plane. 
Of course then we must square them before adding to avoid cancellation. This could be 
written as 

w h m 

2 Z) [&**«* + ("* • h)h) ■ k? = (64) 

x=— w y=— h 

where w and h are height and width of the image region under consideration. 

We see that this equation implicitly relates the measurable quantities s k to the motion 

O O o 

and surface parameters P, Q and n at every image plane point r. So if we interpret the 
former as an output 

y = s (65) 

and the latter as a state 

n 



x = 



P 

q 



(66) 



we have a relationship of the form 

<?(x,y) = 0. (67) 

We have used vector symbols here to denote the imaginary parts of the above quaternions. 
Since s, p and n are purely imaginary and 1 is a unit quaternion it suffices to include 
the three imaginary components in the state - the remaining component is either zero or 
related by some algebraic equation. Unlike our usual output equation (2), equation (67) 
only implicitly relates state and output so that standard observation techniques do not 
apply. Before we consider modifications of our observer to deal with this problem we have 
yet to establish the state equations. 

5.2 The State Equations for the Moving Plane 

o o o 

How do the three components n, p and <7 of our state vector change in time? Let us 
first determine how the surface normal will change due to the motion. This is not trivial 
since we use the inverse length of n to express the distance from the origin. 
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For pure rotation, the distance between plane and origin does not change and the normal 

o 

is simply rotated by Qk- 

O O o O* /•„„■» 

n k +i = Ik^k^k- ( 68 ) 

For pure translation, the orientation of the normal remains unchanged but the distance 

o 

changes by the amount of the translation displacement Pk projected onto the unit normal 
along n so 



1 1 o 

— 1 ft , 


o 

n k 


o --.I** 

K+il Kl 


\ n k\ 


K+il= o , 





which can be rearranged to 



l + Pk-n k 
Since a unit normal along n would remain unchanged by translation, we have 



(69) 



(70) 



o n k o . 

»*+i = 73— K+il 
Kl 



o 

nk 



l + Pk-n k 



(71) 



In this model we will not interpret the motion parameters P and Q as input but as states. 
This allows us to recover them with our observer but we must make some assumption about 
the dynamics of these vectors in order to formulate a state equation. In a general case we 

o o 

could say that Pk and <7& are the solution to some simple difference equation 



Pk+i = PPk and q k+1 = Qq k 



(72) 



which leaves the problem of finding the elements of P and Q. In the case of a moving 
camera these matrices are given by the dynamics of the actuating device (mobile robot, 
motion platform etc.). We believe that this interpretation is worth pursuing since it has 
many practical applications. Here we will content ourselves with an approximation. We 
assume that the motion vectors will not change significantly between frames because the 
sampling rate is high compared to the kinematic changes so that the matrices P and Q are 
null matrices. We then have a special form of (72): 



Pk+i = Pk 



and 



Qk+i = %■ 



(73) 



We can now summarize our state equations from (71), (73) and the implicit output 
equation from (64) 



Xfc-fl 



Pk+i 
. £*+i . 


= 





o 
l+P k -n k 
o 





. 



f(x*) 



(74) 
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g(*k, y k ) = g( 



n k 



Pk 

o 

Ik 



w h 



,° s k)= J2 J2l( q ^k ( i k + (n k -rk)Pk)-Sk] 2 = 0- (75) 

x=—w y=—h 



As we already mentioned, we can restrict ourselves to the imaginary components of the 
quaternions in the above relationships. 

5.3 A Nonlinear Observer for the Moving Plane 

It is clear that if we succeed in constructing an observer for this system we can recover 
depth and motion. However, the output equation is not of the desired form and depends 
on the image plane point at which the brightness gradients are measured. The solution to 
this problem is to modify the observer such that it can handle the implicit output equation. 

Let us recapitulate the basic idea behind the observer in section 2.7. The observer was 
basically a copy of the dynamical system under observation corrected by a matrix multiple 
of the error between actual and estimated output. In our case a discrepancy between the 
"real" state x and the estimated state x will result in a non-zero value for g(x,y). So we 
can simply use g(x,y) as an error term in the discrete version of (19): 



x k+1 = f(x k ) + eg(k k ,y k ). 
The estimation error e k = x^ — x k is found to be 

e k+1 = f(x fc ) - f(x fc ) + e(g(x k ,y k ) - g(x k ,y k )) 



(76) 



(77) 



where we have exploited g(x k , y k ) — to maintain symmetry. The nonlinearity of state and 
output however prevent us from applying the simple rules presented in section 2.7 for the 
calculation of e. Using the ideas of the extended Kalman filter we expand the non-linearities 
about the current estimate 



f(x fc ) = f(x k ) + 



dx 



x k 



(x k - x k ) + . . . « f(x k ) + A k (x k - x k ) 



(78) 



g(x k ,y k ) = g(x k ,y k ))+ — 



x kl y 



(x k -x k ) + ...& g(x k , y k )) + c k (x k - x k ) (79) 



which we substitute into equation (77): 

e k+1 = A k e k - e(c k ■ s k ) = (A k - ecf )e k . 



(80) 



The error difference equation is time-varying and in order to achieve a good dynamic 
behavior we must adjust e with every measurement, so we have e = e*. The idea is then, 
to compute A^ and c^ for every new estimate x k , then select e^ such that the eigenvalues 
of Afc - e k c k are located at the origin and finally compute the new estimate x k +i. The 
matrices A and c have been computed in the appendix A. We then have the following 
algorithm: 
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(1) Set initial values: 

Set Xo = (unless better information is available). 

(2) Observer loop: 

For frames k = 1 to n do 

(2.1) Compute system matrices: 

Compute Afc and c^ using the formulas (139) and (141) for the current estimate 
Xfc and measurement y k . 

(2.2) Compute observer gain: 

Compute the vector e^ such that all eigenvalues of A^ — e^c^ are zero. 

(2.3) Compute next estimate: 

Compute the next estimate x^+i from equation (76). 

Since we are not using a filter, the algorithm becomes simpler which is reflected in the 
block-diagram, figure (8). 
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Figure 8: A block diagram of the observer algorithm 
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5.4 Discussion 

As before, we would like to make a statement about the complexity of this observer 
algorithm. Here, however, we do not have m observers, one for each point of interest. But 
we have an image region of width 2w and height 2h over which we are sampling. If again we 
denote the number of frames by n, the algorithm will run in Q{whn) since one measurement 
for every image point must be processed to compute the vector c in (2.1) in every frame. 
Comparing the complexities is rather useless since the results and assumptions are quite 
different from those in the preceding algorithms. A parallel implementation that provides 
an efficient summation of the expressions in the components of c could reach a parallel 
complexity of 0(n). 

We also note that a closed form solution for the instantaneous problem (2 frames) using 
velocities t and u exists (cf. Negahdaripour and Horn [42]). It may be used to provide a 
good initial value for the observation algorithm. On the other hand, an observer algorithm 
as given above must compete with the repeated application of the instantaneous algorithm. 
Applying Negahdaripour's solution to n frames also requires Q(whn) operations. A com- 
parison of accuracy can take place once implementations of both schemes are available. 
In doing this, we must observe one important detail: The closed-form solution has been 
shown only for instantaneous velocities, not finite displacements and rotations as given in 
this section. If an iterative scheme is necessary to recover the latter from two frames, this 
would clearly increase the complexity. 
We summarize important properties of the above algorithm: 

(1) Incremental recovery of surface orientation and motion parameters. 

(2) Applicable only to planar surfaces. 

(3) Motion parameters are assumed to be constant over time. 

(4) The model uses finite translations and rotations represented by quaternions. 

(5) The discrete brightness change constraint equation is formulated and employed as an 
implicit output equation. 

(6) A nonlinear observer is presented to handle implicit output equations. 

From the theoretical point of view, this is the most appealing of the models presented in 
this paper as it recovers surface parameters rather than distinct depth values. On the other 
hand it only applies to the special case of a plane in motion. Perhaps a fusion of both will 
provide interesting results. 

6 Integrated Motion and Depth Estimation 

The previous examples have shown the duality between motion and depth: If we know 
the depth, we can recover the motion and vice versa. But if we try to solve both problems 
simultaneously we must impose constraints that restrict the solutions to rather specialized 



27 



cases. A conclusion might be to separate the two computations and feed one with the result 
of the other in an iterative scheme. 

We have seen further that the Kalman filter is not particularly well suited for motion 
estimation since no dynamical model of the behavior of the motion parameters is available. 
The idea in this section is therefore to estimate depth using the Kalman filter and update 
the motion estimate to fit the depth map in a least-squares sense in every iteration. 

The algorithms in this section make no assumptions about surface or motion other than 
the ones inherent to the BCCE and are therefore most likely to be successful on real imagery. 
For this reason we discuss all the necessary details in-depth as for instance measurement 
variances and pixel interpolation. We have tailored the scheme to two different implementa- 
tion environments: one in which only brightness values are available as measurements and 
the other in which a correlation- based displacement estimator is used. Conceptually both 
approaches follow the same idea but they have different efficiency and stability properties. 

6.1 Using brightness measurements directly 

The brightness change constraint equation is the foundation of this approach. We use 
it in its differential form 

— = E r • r + E t = (81) 

where r is the motion field from equation (26). If we substitute the motion field for r we 
obtain 

c + v«+-(s-t) = (82) 

after some rearrangement (refer to Negahdaripour and Horn [42] for the derivation). We 
have used the following abbreviations: c = E t , s = (E r X z) X r and v = -s X r. The reader 
will have noticed that we have again resorted to the continuous case. The reason for this 
lies in the fact that the brightness change constraint equation is simpler in this form and 
yields a closed-form solution for the motion when depth is known. 

We must also specify the dynamical system that will be used. It will be one-dimensional 
with the sole state-variable Z(t). The plant equation is simply the third component of the 
kinematic equation (25) i.e. 

Z = t z + (u> x y - u> y x)Z. (83) 

Z will also be our measurement so we have a very simple model. There is a slight problem 
with this formulation, however, which lies in the fact that x and y also vary in time and 
should be included in the state. We avoid this by interpreting them as time- variant param- 
eters that must be updated in every iteration according to the motion-field equation (26). 
What this means is that the depth prediction Z according to the above plant equation will 
be valid at the location specified by the new coordinates (x', y') which may not lie on a grid 
point of the image array. This leads to the problem of having to interpolate the depth as 
we mentioned in section 4. We discuss the treatment of these problems in detail below. 

Now consider the block diagram of figure (9). A new image arrives at iteration k and 
is fed into the depth estimator. Using the motion estimate from k — 1 we can solve the 
brightness change constraint equation (82) for Z at every image point and obtain a dense 
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Figure 9: Integrated depth and motion estimation based on brightness values 



depth measurement. This is used in the update stage of a Kalman filter to produce a depth 
estimate. Note that in this case the state variable being estimated and the measurement 
quantity are identical. The Kalman filter is merely used as an update algorithm. 

The updated depth map is used to compute the motion. We compute the motion such 
that the error in the brightness change constraint equation (the deviation from zero) is 
minimized in a least-squares sense over the region of interest. Finally we use the Kalman 
filter prediction to compute the expected depth map in the next measurement which is 
made available to the filter's update stage. We now discuss the various modules in detail. 

6.1.1 Depth Estimation 

Estimating depth proceeds in a straightforward manner based on the brightness change 
constraint equation (82). We solve for Z and obtain 



Z- — 



st 

C + V -U)' 



(84) 
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The quantities c, v and s depend on the spatial brightness gradients E x and E y and the 
temporal brightness gradient E t which can be approximated using finite differences of the 
image brightness values. The motion parameters t and lj are taken from the previous 
motion estimate. We thereby obtain a dense depth map. 

Since this Z will be the measurement quantity in our Kalman filter, we also need its 
variance. A closer look at (84) reveals that noise in the measurement of the brightness 
values will affect Z. Let us recall a basic result of probability theory. Suppose we have a 
random variable Z which is a function of a collection of random variables x\,...,x n so that 

Z = f(x u ...,x n ). (85) 

Using a Taylor series expansion of / around the "true" value of Z we find that 

»=i j=i * •? 

where cov(xi,xj) denotes the covariance of the two random variables. Now if we assume 
that the X{ are independent, which is usually the case in measurement processes, we have 
cov(xi,Xj) = for i jt j and cov(xi,Xj) = a Xi for i = j. We then have 

In our case the brightness measurements are interpreted as random variables and we 
are interested in determining how the variance in the brightness measurements will affect 
the variance of the depth estimate. We must find the immediate dependency of depth and 
brightness in equation (82). We substitute the expressions for c, v and s into (82) and find 

z _ E x (t x - xt z ) + E y (t y - yt z ) _ 

E t + E x (xyoj x - (1 + x 2 )uj y + yu> z ) + E y ((l + y 2 )u> x - xyu> y - xu z ) ' 

We abbreviate the coefficients of the brightness gradients to obtain a more compact expres- 
sion 

_ aE x + bE y , gQ , 

E t + cE x + dE y ' K ' 

The brightness gradients are estimated through some finite difference approximation as for 
instance the one suggested by Horn and Schunck [27] : 

E x « -^(E(x + d x ,y,t)- E(x,y,t) + E(x + d x ,y + d y ,t) - E(x,y + d y ,t)+ 
E(x + d x , y ,t + T)- E(x, y ,t + T) + E(x + d x ,y + d y , t + T)- 
E(x,y + d y ,t + T)) 

E y w ^-(E(x,y + d y ,t) - E(x,y,t) + E(x + d x ,y + d y ,t)- E(x + d x ,y,t)+ 

E(x, y + d y ,t+T)- E(x, y,t + T) + E(x + d x ,y + d y , t + T)- (90) 

E(x + d x ,y,t + T)) 
E t « j r (E(x,y,t + T)-E(x,y,t) + E(x,y + d y ,t + T)-E(x,y + d y ,t)+ 
E(x + d x , y,t + T)- E(x + d x , y, t) + E(x + d x ,y + d y , t + T)- 
E{x + d x ,y + d y ,t)). 
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We have denoted the distance between two pixels in the x direction by d x , in the y direction 
by d y and the inverse frame rate by T. Note that this approximation is the average over 4 
first differences along the edges of a cube in spatio-temporal hyperspace. The approximation 
is valid at the center of that cube. 

We can apply our formula (87) for the propagation of variances to calculate the variance 
in Z. Note, however, that the E x , E y and E t are not independent and can therefore not 
be used. We must express Z in terms of the Ei (the brightness values used in the gradi- 
ent approximations (90)) which we may assume to be independent identically distributed 
random variables with variance o\. We then find the variance in Z to be 

which evaluates to 

1 (aE t + (ad - bc)E y f j? + (bE t + (bc- ad)E x f £■ + (aE x + bE y ) 2 fr 

° z = 2°* " (E t + cE x + dE y y ' * (92) 

We have determined how the variance in the depth is related to the variance in the 
measurements. But what is the variance in the measurements? One component of the 
measurement noise can be modeled: the quantization noise. Since our sensor discretizes 
brightness values into gray-levels, values that do not coincide with a discretization step will 
be "rounded off" to the next step. In our model of the sensor, the discrete sensor output 
values will be denoted by Ei and the constant quantization step by AE = E{+\ — Ei. If we 
assume that our sensor discretizes in the following fashion 



\ Ei 



JP _ } ^{+1 if Eactual > Ei + AE/2 , „s 

iWor-i „. . f Eactual < E . + AE/2 W 



and that brightness values are equally likely in [Ei,Ei+i] the expected sensor output is 
easily determined as (JSj+i + Ei)/2 and the variance as 

*-^. («) 

However, this models only one possible source of noiseand distortion. Others include the 
electrical circuitry in the sensor or simply blurring or defocusing. So instead of attempting 
to model all these various noise sources, we can simply conduct a measurement in which 
a uniformly colored simple surface is viewed to provide a set of measurements E\, . . . ,E n . 
We then determine the expected value 

£(£)=if> (95) 

and the variance 



n . 

!=1 



2 =rE(^-£(£)) 2 - (96) 



8=1 



Using these results we can obtain a dense depth map where every depth value is accompa- 
nied by a variance that will be used by the filter we describe below. 
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6.1.2 Kalman Filter Update of Depth Map 

In order to update our estimate Zk we must first compute the filter gain from (21). 
Since the system is one-dimensional, this is a scalar 

e* = ft (ft +'*)- 1 = -=*£- ( 97 ) 

Pk "t" 7 * 

where r^ is the variance of the measurement. We are measuring the depth so rk = o\ 
which was computed in the previous subsection. 

The filter update equations from (21) are also extremely simple: 

Zt = Z;+e k {Zk-Z^) (98) 

p+ = {l-e k ) P - k . (99) 

We now have a new depth map and also a covariance map. The former is used in the 
motion estimator described below. 

6.1.3 Motion Estimation 

This module must solve the following problem: given the depth at every image point, 
what is the global relative motion t, u between camera and environment. Under the 
assumption of the validity of the brightness change constraint equation, this problem has 
been solved [29]. Assuming that the region of interest has width 2w and height 2h centered 
at the origin, we are seeking a motion estimate t and u that minimizes 

w h i 

E E( c + v ^+^( s -t)) 2 - ( 10 °) 

X= — W y— — h 

Differentiating with respect to t and u and equating to results in 

w h i w h -i w h 

( E E (- r )|v + ( E E (« T )^)t = -E Y,r 

X=—Wy=z—h X=—Wy= — h X= — Wy= — h 

w h w h -i w h 

(E E(w T )v + (E E(*v T )j)t = -YE™ 

x=—wy= — h x=—wy= — h x=—wyz=—h 

a 6x6 system that can be solved for the desired parameters in general. 

6.1.4 Kalman Filter Prediction of Depth Map 

Given the current estimate for the motion parameters and the updated estimate for the 
depth map the prediction phase of the Kalman filter produces a predicted depth map for 
the following iteration. Looking at the Kalman filter (21) we see that we need predictions 
for state and covariance. A simple first differences approximation for our plant (83) yields 
the following discrete form 

Z k+1 = Z k + T{t z + (u x y - u y x)Z k ) (102) 
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(101) 



which is the prediction equation for the depth. We can see that the system "matrix" is 

4> k = 1 + T(u x y - u y x). (103) 

It is needed in the prediction equation for the covariance which reads 

Pl +l = [1 + T(u x y - uyx)] 2 p+. (104) 

As we have argued previously we have assumed that the plant equation of the system is 
not noise corrupted as it is purely kinematic. Therefore q k = 0. 

This procedure, however, is not quite correct. We have neglected the fact that the new 
estimate Z^ +1 will not be valid at the same image plane location (x,y) at which Z% was 
stored since x and y change over time. To make things worse, the position at which our 
estimate will be valid may not coincide with a grid point of the image array. So we must 
determine where the point at which we are estimating the depth moves to and interpolate 
the depth at the grid points from this transformed depth map. 




Figure 10: Displacement of observed points between successive images 



The movement of a point (x,y) in the image plane is described by the motion-field 
equation (26). A finite differences discretization yields 



y > = y + T( ~ ty + ytz - u y xy + u x (y 2 + 1) - u z x) 



(105) 
(106) 
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which provides us with the coordinates (x 1 , y') at which our predicted depth Z^ +1 is valid. 
We can think of this as a new depth map which evolves from the old one by the prediction 
process as depicted in figure 10. The remaining problem is to interpolate the depth and the 
variance at the grid points of the image array. We investigate two methods for accomplishing 
this in which we focus on the depth interpolation. 

The first method simply fits a plane to the data in a least-squares fashion. Supposing 
the depth values Z{ at coordinates (x t -,y t ) for i = 1,. . . ,n are within some neighborhood 
of the grid-point we are considering. We locally approximate the real-world surface by a 
plane 

aX + bY + cZ = 1 (107) 

and using the perspective projection equations we have 

axZ + byZ + cZ = 1. (108) 

We seek to determine the parameters a, b and c such that the deviation of our depth values 
from the resulting plane 

n 

^2(axiZi + byiZi + cZi - l) 2 (109) 

t=i 

is minimized. The minimum can be found by differentiating and equating to zero. This 
yields the following 3x3 system 

n n n n 

a^ixiZif + b^XiyiZf + c£>,Z 2 = £>,£; 

j=l i=l t'=l i=l 

n n n n 

a^XiViZ? + bJ^iViZi) 2 + cJ^ViZf = £^i (no) 

8=1 »'=1 »'=1 i=l 

«X>Z 2 + bJ2viZf + cjzz? = x> 

i=l t'=l i=l t=l 

which can be solved for a, b and c. Then the depth value at the grid-point xo, yo is easily 
computed as 

z = Tir-r-- ( 1U ) 

ax + by + c 

The method is rather intensive in terms of computational cost. We must compute 9 
different sums over all of the n measurement values and solve a 3x3 linear system for ever 
interpolation point. We only obtain an interpolation for a planar approximation. 

An alternative that is more computationally oriented is to compute the grid-point value 
as some weighted average of the estimates in its neighborhood 

Z(x',y , ) = £w i Z(x i ,yi). (112) 

j=i 
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We have set up some criteria for the choice of the weighting function w,- which can be found 
in appendix B. We show there that 

1 

m = -#-. (113) 

t=i * 

fulfills a set of natural requirements that one would demand of such a weighted average. 
The appendix also contains a linear-time algorithm for computing the weights and the 
average so that the resulting scheme is very efficient. 

Other interpolation schemes include bi-linear and bi-cubic interpolation. See Rifman 
and McKinnon [46], Bernstein [4] or Abdou and Wong [1] for some interesting and practi- 
cally valuable techniques. 

6.1.5 Discussion 

Our scheme for integrated motion and depth estimation is now complete. The charac- 
teristics and deficiencies of this scheme are summarized below. 

(1) We recover a dense depth map using a Kalman filter and a motion estimate to fit the 
depth map in a least-squares sense. 

(2) The validity of the brightness change constraint equation is assumed. No other as- 
sumptions about surface, reflectance or motion are made. 

(3) We need no optical flow or displacement estimator - the algorithm operates directly 
on brightness values. 

(4) Estimates are based on gradient approximations that tend to have little numerical 
robustness. 

(5) Depth and motion estimates rely on the same physical relationship. This may lead 
to instability of the iterative estimation. 

Item (4) is an inherent property of all schemes that rely on the differential form of the 
brightness change constraint equation. An alternative is to consider an equivalent inte- 
gral constraint to obtain higher robustness. The 6th item reflects the "chicken-and-egg" 
problem that arises when we sequence motion and depth estimation rather than recover 
them simultaneously. Since the latter cannot be achieved in general, we rely on the filter 
to provide sufficient convergence to render the effect of this problem negligible. We address 
these concerns by slightly altering the implementation as described below. 

6.2 Using SSD-based displacement estimates as measurements 

This section investigates an alternative measurement procedure to the one based on 
the differential form of the brightness change constraint equation described in the previous 
section. In order to remedy the problems with the differential approach (refer to items 
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5 and 6 in the discussion) we require two main properties of an alternative formulation: 
measurements should be based on information from a larger image area (an integral rather 
than a differential approach) and the measurement process should be independent of the 
motion estimation. 
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Figure 11: Integrated depth and motion estimation based on correlation-based displacement 
measurements 



Figure (11) has basically the same structure as the one in the previous subsection. We 
have replaced the depth estimator by an sum-of-squared-differences (SSD) displacement 
estimator. The output of this system is a vector r^ = (xk,Vk) for every image point which 
describes where a given pixel in the previous image has moved to in the current image. 
Given these measurements we are forced to change the structure of our dynamic system 
to produce these quantities as output. We resort to our initial formulation of the object 
kinematics and have chosen the discrete formulation (30) for the sake of variety. We have 



O O 0* 

Rk+i =Pk + 1kRk<lk 



(114) 
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where R denotes the real world position of a point on the surface being viewed and P, q 
represent translation and rotation. The components of our plant are therefore 

Xk+i = Px + q22X k + q23Y k + q2*Z k (H5) 

Yk+i = Py + q?aX k + q33Yk + qz\Z k (116) 

Zk+i = Pz + q42X k + q&Yk + 944^fc (I 17 ) 
and the output is simply 

x k = X k /Z k (118) 

Vk = Y h IZ k . (119) 

The qij in the above equations are the elements of the matrix in (29) introduced for the 
quaternion representation of rotations. We see that we are trading off higher accuracy for 
complexity of the system. 

6.2.1 SSD-based displacement estimation 

Displacement estimation is related to optical flow estimation in that displacements are 
the product of the instantaneous velocity and the inverse frame rate. While optical flow 
estimation has proceeded along the line suggested by Horn and Schunck [27], displacement 
estimation using correlation- based methods or sums-of-squared-differences has proven to 
be effective. Comparisons of typical correlation-based matchers can be found in Hannah 
[17] and Burt, Yen and Xu [9]. An in-depth study of the applications of these techniques 
to the estimation of displacement fields in motion sequences is presented by Anandan [2]. 
In particular, correlation- based estimation has proven to be useful when large interframe 
displacements occur. On the other hand, the method encounters difficulties with foreshort- 
ening. 

We employ the sum-of-squared-differences (SSD) technique which we briefly describe 
below and in figure (12). Suppose we wish to determine where the image of a point in the 
real world that was located at coordinates (x, y) in frame t has moved to in frame t + T. 
Let the new coordinates be (x',y') = (x + Ax,y + Ay). We assume that the brightness 
in a neighborhood of the point of interest does not change significantly from one image to 
the next (refer to the brightness change constraint assumption). To obtain a measure of 
the quality of a given displacement estimate (Ax, Ay) we can therefore sum the squares of 
the differences in the brightness values of corresponding points in the neighborhoods of the 
original and displaced image point. More formally if E(x, y, t) denotes the image brightness 
at location (x,y) in frame t we have 

&(Ax,Ay)= Y,[ E ( x ,y' t )- E ( x + Ax >y + A y> t + T )] 2 ( 12 °) 

x,yeP 

as a measure for the error in a displacement estimate (Ax, Ay). P is used as a symbol 
for a set of image points constituting the neighborhood to be considered in the correlation. 
The displacement estimator minimizes the above error to produce (Ax*, Ay*) such that 

0(Ax*,Aj/*)= min ®(Ax,Ay). (121) 

Ax,AyeD 
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Neighborhood P 



Figure 12: SSD displacement estimation 



In a practical implementation such as the one by Little [33] which we intend to use in 
our scheme, ®(Ax, Ay) is computed for some set of displacements D in frame t + T. The 
resulting set of error values is searched for a minimum to yield (Ax*, Ay*). 

In order to utilize such an estimate in a Kalman filter we need a quantification of 
its variance or more precisely the covariance matrix for [x', y'] T which is the same as for 
[Ax,Ay] T : 

cov(Ax,Ay) 



R = 



*L 



cov(Ax,Ay) 



'Ay 



(122) 



We may assume that estimation error in the x- and ^-directions are independent so that 
cov(Ax, Ay) = 0. The remaining variances are due to the variance in the brightness 
measurement as described in the previous section. However, the measurement procedure 
itself contributes to the uncertainty. 

Anandan [2] suggested a "confidence measure" which was justified intuitively in the 
following way. The residual error ®(Ax*, Ay*) is one component of the confidence measure: 
the higher it is the lower our confidence in the measure. The second component is due to 
the fact that there must be significant variation of brightness within the neighborhood P 
in order for the correlation scheme to be able to identify the neighborhood in the next 
image. In other words points in areas of uniform brightness are impossible to localize in 
successive images. The notion of variation of brightness throughout the patch is captured 
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by the curvature of the SSD function i.e. its second derivative. The confidence measure 
was therefore essentially the second derivative of divided by the residual error in the 
optimum Q(Ax*,Ay*). 

Unfortunately, there is no formal concept of a confidence measure and hence no deriva- 
tion to justify it exists - it is merely intuitive. It is also intuitive that a variance should be 
the "inverse" of such a confidence measure although the connection to probabilistic mea- 
sures is not clear either. We have found a formal derivation a\ x and a\ y which verify 
Anandan's conjecture. The derivation is rather lengthy and is given in appendix C. The 
derivation results in the following values for the variances 



„2 
°Ax 



_2 
a Ay 



= 8a 



= 8a 



Eq2 02 _ Q4 

^xx^yy ^ xy 

E Q2 02 _©4 ■ 

^xx^yy xy 



(123) 
(124) 



Notice that the denominators contain the Hessian of which indicates the curvature while 
the numerators contain the residual error 0. These are the components of the confidence 
measure suggested by Anandan. The above variances are needed in the Kalman Filter 
update process described below. 



6.2.2 Kalman Filter Update 

Before the actual update can take place we must first compute the filter gain. From 
(21) we have 

E fc = P^CfctC^C* + iU]- 1 (125) 



where 



C k = 



is the Jacobian of the output equation and 



R* = 



k ° 

d 









°ly . 



(126) 



(127) 



is the measurement covariance computed in the previous subsection. We see the increased 
computational cost here as we must invert a two-by-two matrix for every pixel whereas the 
previous scheme involved only a scalar inverse. 
Now the filter update proceeds as usual 



R fc = R it + E *( 



P+ = (I-E fc C*)Pfc 



x 



zz 



(128) 
(129) 
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where x' and y' are the outputs of the displacement estimator. Also note that due to the 
fact that we are now measuring displacements (i.e. in a sense the first two components 
of the state vector) rather than the depth, we must perform the depth map interpolation 
after the update instead of after the prediction. The predicted output is compared to the 
displacement estimate in the update equation above. Therefore it is not necessary that the 
depth values are valid at grid-points after the prediction stage. We will, however, perform 
the grid-point interpolation after the update to enable motion estimation and prediction to 
operate in the usual fashion. The interpolation proceeds as described in subsection 6.1.4. 



6.2.3 Motion Estimation 

In this module we recover the motion that best accomplishes the transformation between 
frames k — 1 and k or rather between our estimates of the state variables at these times. If 
we denote the image region under consideration by P then we seek the motion parameters 
Px,Py,Pz and qij for i,j = 2,3,4 such that 



£p[ (X k ~Px + Q22Xk-l + q23Yk-l + ?24^fc-i) 2 + 
(Y k ~Py + Q32Xk-l + 933^-1 + qS4Z k -l) 2 + 

(Z k -p z + q42Xk-i + 943^-1 + quYk-i) 2 ] 



(130) 



is minimized. 

The necessary conditions for a minimum are the result of differentiating the above 
expression with respect to all of the motion parameters and equating to zero. This yields 
12 linear equations for the parameters. But since the above error sum contains three 
independent terms, we really have 3 four-by-four systems to solve. They have the form 



EpI 
J2p Xk-i 

ZpYk-i 

Ep Zfc-i 


Y^pXk-i 
Ep ^ifc-i 

J2p Xk-\Yk-\ 

YlpXk-\Zk-\ 


Epn-i 

JlpXk-iYk-i 

Ep Yjt-i 

HpYk-\Zk-\ 


Ep Zfc-i 

Y2pXk-iZk- 

HpYk-iZk-i 

Ep z\-\ 


1 

L 

'E 
E 
E 
E 


pi 

Qi+1,2 

Qi+1,3 

. 9i+l,4 . 

P Ri,k 

p Xk-iRi,k 
pYk-\R%,k 
p Zk-iRi,k 



(131) 



for i = 1,2,3. We have used the notation pi to denote the components of the translation 
quaternion and Ri t k to denote the components of Rfc, i.e. Xk, Yk and Zk- This somewhat 
alleviates the computational burden. 

To complete the motion estimate, we must ensure that the matrix 

922 923 924 
Q = 932 933 934 (132) 

942 943 944 

is orthonormal or in other words that the rotation quaternion has unit magnitude. We seek 
the nearest orthonormal matrix to Q which is 



Q = Q(Q T Q)- 1 / 2 = Q( v / ATuiuf + v/A^uJ + ^u 3 u^) 



(133) 
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where A» denotes the ith eigenvalue of Q and u,- the corresponding unit eigenvector. This 
again involves quite an amount of computation. 

6.2.4 Kalman Filter Prediction 

We note that the Jacobian of our dynamical system $& is simply the matrix Q fc from 
the previous subsection. The prediction equations become 

Rfc+i = Pjb + QfcRlt ( 134 ) 

Pfc + i = Qk^tQl ( 135 ) 

where we have once again assumed that the plant noise is zero. 

In the block-diagram we have fed the state estimate back into the displacement esti- 
mator. The idea is to use this estimate to limit the search the SSD-matcher must do. In 
the description of the displacement estimator we denoted the region of displacements the 
estimator would consider by D. This region will now be a neighborhood of the predicted 
displacement (X^ /Z£ ,Y£~ /Z£) which limits the amount of search considerably. 

6.2.5 Discussion 

This scheme is well suited for alleviating the problems of the previous one namely to 
decouple motion and depth estimation. The displacement estimator no longer requires the 
motion estimate for its operation. In addition our measurement process is no longer based 
on gradient approximations but rather draws information from a region of the image plane. 
We expect this to result in greater robustness and a higher rate of convergence. 

The key features of the approach presented above are summarized here: 

(1) An SSD-estimator is used to measure displacements. 

(2) A discrete object motion model using quaternions is employed. 

(3) An Extended Kalman Filter is used to estimate real-world object coordinates. 

(4) The SSD-estimator is based on a brightness constancy assumption. No additional 
assumptions about surface, reflectance or motion are made. 

(5) A dense depth map and discrete motion parameters are recovered. 

(4) The scheme is computationally quite intensive. Among the time consuming operations 
are: searching for the optimal displacement, matrix inversions for the gain compu- 
tation, solving three 4x4 systems for motion estimation, renormalizing the rotation 
matrix. 

The high computational cost may be the main deficieny of this scheme. We intend to 
use the Connection Machine for the implementation of the algorithm in order to achieve 
near-real-time performance. 
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It is worth pointing out some essential differences between the algorithm we have sug- 
gested in this section and the work of Matthies, Szeliski and Kanade [37]. In both cases a 
Kalman Filter is used to predict and update a dense depth and variance map. However, 
the latter approach is restricted to translational motion parallel to the image plane. In 
addition, motion must be known at every point in time. We have imposed no restriction 
on motion and have further shown how motion estimation can be incorporated into the 
iterative filtering process. This makes the our approach applicable to a far more general 
class of imaging situations. 

7 Conclusion 

We have seen three conceptually different schemes for modelling the imaging situation as 
a dynamical system and using a state space observer/ filter to recover parameters of interest 
such as depth and motion. The second algorithm presented differs structurally from the 
other two in that it recovers surface parameters instead of a dense depth map. Of course this 
requires a parametrization of the surface which is usually unable to handle discontinuities. 
But the parameter- based approach may have higher robustness and be well suited for 
specific applications. One such application is the landing approach of an airplane in which 
we may approximate the runway area as a plane and seek to recover the relative orientation 
of aircraft and runway. The parameter-based approach can naturally be extended to more 
complex surface structures such as quadratic patches and more sophisticated motion models 
such as constant acceleration. Another interesting idea is to approximate locally a complex 
surface by planar patches and apply the ideas to those patches. But before elaborating 
either of the models, they should prove their capabilities in an implementation. 

The first algorithm presented in this paper attempted to model the correct physical 
relationships involved in the formation of brightness values. We saw that this forced us 
to make very strong assumptions about the reflectance properties of the surface and the 
illumination conditions. Most real images will not satisfy these conditions and the algorithm 
is expected to operate accurately only on synthetic data. 

The two variants of integrated motion and depth-estimation presented last are most 
promising for application to real images. They show how the Kalman filter can be in- 
corporated into existing motion vision schemes to achieve incremental depth estimation 
over more than 2 frames. We also noted that a tradeoff between accuracy/robustness and 
computation expense is involved of which the presented alternative implementations are 
good examples. The Kalman filter proved useful for the depth estimation process because 
this quantity could be interpreted as the state variable of a dynamical system. Motion 
estimation, however, is not possible without some dynamics model of the actuating device. 
An interesting idea would be to incorporate the dynamical model of a mobile robot, which 
carries the camera into the plant equations, and apply the Kalman filter for simultaneous 
depth and motion estimation. 

The depth map approaches also reveal why the application of filtering theory has pre- 
viously focused on feature matching: displacements of features are known rather precisely 
and provide a good measurement for the filter. On the other hand they have the deficiencies 
discussed earlier which led us to consider non-feature based algorithms. 
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It is important to mention that the dynamical systems approach is not limited to the 
application of filters for depth and motion estimation. There are two other applications 
that come to mind. What for instance would it mean to close the feedback loop over one 
of the dynamical systems presented above? It means that we measure some indication of 
motion in the image plane and feed it back to control the motion of the camera. This could 
be used in tracking or some other orientation procedure of the camera carrier such as in 
mobile robots or aircrafts and satellites. 

Another powerful technique that one could import from control theory is system iden- 
tification. Sophisticated techniques have been developed to estimate the parameters of 
dynamical systems which are the constants in the plant and output equations. What does 
it mean to perform system identification on one of our models? The model in section 4 
which is based on brightness values contains the diameter of the camera lens and all models 
implicitly contain the focal length. We can use one of these identification procedures to 
measure these quantities. Identification may also aid in dealing with more complex motion 
models as the ones presented in equation (72). In this case we would have to perform online 
identification i.e. identify the parameters while the observer estimation is being performed. 
This may lead to interesting and powerful models. 

The main objective in this paper was to introduce a systematic way of dealing with a 
series of frames in motion vision. We have shown that dynamical systems provide a way 
for dealing explicitly with time dependency and have formulated the solution to a common 
motion vision problem in terms of such systems. It is conceivable that is approach can be 
extended to other problems in which information is acquired over a series of frames such 
as multiframe edge-detection, segmentation, color, texture etc. Unlike previous approaches 
mentioned in the introduction we have attempted to found our models rigorously on physical 
facts only. We intend to carry out experiments to corroborate our theoretical results. 

Acknowledgments 

Berthold Horn encouraged me to look at the problem of filter application for the mov- 
ing plane and using the BCCE as a measurement equation. Tommy Poggio, Hans-Hellmut 
Nagel and Ellen Hildreth made valuable comments on earlier versions of the paper, pro- 
vided me with excellent references on the subject and ensured that the applicability of the 
algorithms to real images remained the most important concern. I would also like to thank 
my friends Sundar Narasimhan and Brian Subirana for the inspiring conversations we had 
about some of the problems in this paper. 



43 



Appendix 

A System Matrices for the BCCE Observer 

In section 5.3 we found that the matrix 

8f 



Afc= 0x 



(136) 
x fc 



and the vector 

r - 9g 



(137) 

x*,y fc 

are necessary to compute the observer gain vector e*;. The vector function f(xfc) is given 
by (74), the implicit output 0(xfc,y fc ) by (75). In this appendix we give the elements of 
the above matrices for convenient computation. We omit the hat representing estimated 
values and the index k of time. We use the following abbreviations: 

• The surface normal n = [n x ,n y ,n z ] T 

• The translational displacement p = \p x ,Py,Pz] T 

• The position vector of an image plane point r = [x, y, 1] T 

• The brightness gradients E x , E y and E t 

o 

• A vector x has a corresponding purely imaginary quaternon x. 

• The rotation 9 is a unit quaternion so q% = 1 — q x — q y — q% 
The elements of A are: 

_ J,„2 _2 „2 , 1 + n y p y + n z p z 
an - q + q x -q y ~q,+ (1 + n . p) 2 

o<„ n „ „\ nxp y 

ai2 = 2{q x q y - q q z ) - + p , 2 
ai3 = 2(q x q z + q q y ) - "*** 



nl 



«14 



(l + n-p)2 
ai5 = -; 



«16 



(l + n.p)2 
(l + n-p)2 



a 17 = 2n y (q y + ^) + 2n z (q z -^) 

qo go 
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Q 2 

a 18 = -4n x q y + 2n y (q x + -2-^-) + 2n z (q -) 

9o lo 

2 

ai9 = -4n x q z + 2n y (-q + — ) + 2n 2 (# r — ) 

go Qo 



«21 

«22 = 

a 2 3 = 

a 24 = 

025 = 

«26 = 

a 2 7 = 

«28 = 

029 = 



= %*?y + 9og*) 



't'yVx 



(l + n-p) 2 



2 2 i 2 2 , J- H - n xPx + n zPz 

1o-1* + 1y-<l.+ (l + n . p) 2 

2(g, 9 , - «,&) - (1 + % n p ; p)2 

~(l + n-p) 2 

«2 



(l + n-p) 2 
(l + n-p) 2 



2n x (q y ) - 4n y q x + 2n z (-q + — ) 

9o <7o 

2n,( fc - Mi) + 2 „,( g , + *&.) 

qo qo 

2 

2n x (q - ^5.) - 4n y ?, + 2n 2 ( 9j/ - 2f*» ) 
<7o <Zo 



(138) 



«31 

a32 = 

033 = 

azi = 

^35 = 

«36 = 

«37 = 

«38 = 

039 = 



= 2(g r g, - q q y ) - 



n z p x 



(l + n-p) 2 
2(?»?, + <?ofe) - ( 1 + n . p )2 

q -q x -q y + q z + (1 + n . p)2 



n E n 


z 


(1 + n- 

n y n 


•P) 2 

z 


(1 + n- 


•p) 2 


(1 + n. 


p) 2 



2n,(g, + 222i) + 2n y (q - ^) - 4n 2<?;E 
9o 9o 

2 

2n x (-g + ^) + 2n y (q z - Ml) _ 4 n^ y 
<?o 9o 

2n x (q x + - L - i ) + 2n y (q y — ) 

qo qo 
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ajj ; = fori = 4,..., 9 j = l,...,9 
The elements of c are: 

w h 



(139) 



x=— w y=— h 
w h 

x=— w y=— h 
w h 

<* = 2E E 

X— — Wy=z — h 

w h 

C4 = 2^ E 

^2EE 

i=-«i]/=-ft 

x= — Wy= — h 
w h 

<* = 2E E 
<* = 2£ £ 

C9 = 2E E 



o o* o o o vn/ x 

(?r? + P(n • r)](s • p)x 



0* o o.., . 

(qrq + p(n • r)](s • p)y 



OoO* o 0,,, v 

(qrq + p(n ■ r)](s • p) 



o o* o o o. n/ . 

(grg + P(n • r)](n • r)^ 



o o* o o 0. ., . _ 

(grg + p(n • r)](n ■ r)E y 



,0 o* 0. 



(140) 



(grg + P(n • r)](n • r)(£ t - xE x - yE y ) 
(grg* + p(n ■ ?)](2y£„(g, + Mi) + 2 E t (q 2 



qo 



qo 



-)) 



(grg* + P(n • r)}(2xE x (q x - M» ) + 2 ^(g, + ^)) 

go go 



(qrq + p(n • r)](2^ I ( 9x + Mi) + 2 *i^(g y - Mi)) 

go go 



B An Efficient Pixel Interpolation Scheme 

When computing Kalman filter updates or predictions of depth and variance maps the 
problem arises that the updated value is no longer valid at a grid-point of the pixel-array 
because the projection of the point under observation has moved. The proposed solution 
in sections 6.1.4 and 6.2.2 was to reinterpolate the grid-point values. 

Here, we give a computationally efficient solution for the following problem: Given n 
points (xi,yi) and a function value Z(xi,yi) at those points interpolate Z at a point (x,y). 

We will compute the interpolated function value as a weighted sum 



Z(x',y') = J2 w i z ( x i'Vi)- 
«=i 

A good weighting function should fulfill the following requirements: 
• < to, < 1 



(141) 
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• E?=i wi = 1 

• The W{ should decrease as the distance d 2 = (a;,- — x') 2 + (yi - y') 2 decreases. 

We therefore choose the weighting factors to be 

l 

m = -%-. (142) 

^2d7 

This clearly fulfills all of the requirements but some special cases must be considered. 
Suppose all n estimates involved in the interpolation have equal distance di = d from the 
grid-point. In this case 

««=-£-=£ (143) 

Ei n 

which means that all estimates are weighted equally as one would expect. A more tricky 
case occurs when some number k of the estimates are actually located at the grid-point, 
i.e. di — for i = 1,. . . ,k. For estimates on the grid-point we can rearrange the expression 
for the weights (142) to obtain 

Wi = ► r (144) 

J=fc+1 

as di — ► for i = 1, . . . , k. Similarly we rearrange the expression for the weights in the case 
of an estimate that does not coincide with the grid-point 

I 

} - (145) 



* i=k+i ' 

as di — > for i = 1, . ..,k. In other words we ignore all estimates that are not on the 
grid-point and obtain the interpolated value as the mean of the estimates located at the 
grid-point. 

This may be very easily encoded into an efficient 0(n) algorithm which we give in the 
following pseudocode notation: 

k^0;W ^0; 

For i <— 1,.. . ,n 

d 2 ^(xi-x') 2 + (yi-y') 2 ; 

if df = then k <- k + 1 else W <- W + 4; 
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if k > then 



For i <— 1, .. . , n 

if d| = then w,- <— % else w,- <— 0; 



else 



For i *— l,...,n 

When dealing with real estimates they will rarely fall directly onto the grid-point and 
we may wish to replace the test d] — in the above algorithm with d? < e where € is chosen 
according to the numerical accuracy of our processor. 

C Variance of the SSD-Displacement Estimator 

In this appendix we give the derivation of the variances a\ x and a\ y of an SSD dis- 
placement estimate. Recall that in section 6.2.1 we have introduced the SSD functional 
120 

Q(Ax,Ay)= ]£[£(x,y,*)-£(x + Az,2/+At/,i + T)] 2 (146) 

x,yeP 

which quantified the quality of a match between point (x,y) in frame t and point (x + 
Ace, y + Ay) in frame t + T. The optimal displacement (Ax*, Ay*) was defined by 

Q(Ax*,Ay*) = min &(Ax,Ay). (147) 

Ax,AytD 

In the following we abbreviate the second partial derivatives of with respect to Aa; 
and Ay by 0^ and 0j, and the second partials by 0^, Q xy and Q yy . 

The displacement estimate is the result of minimizing (146). Differentiating this equa- 
tion with respect to Ax and Ay yields the necessary conditions for the minimum: 



BE 

e x (Ax, Ay) = -2 £ [E(x, y, t) - E(x + Ax,y + Ay, t + T)]-^ 

x,ytP 

Q y (Ax, Ay) = -2^2 l E ( x > V, - E ( x + Ax > V + A V, t + T )\ y 

x,ytP 



(148) 

(x+Ax,y+Ay,t+T) 

(149) 
(x+Ax,y+Ay,t+T) 



must both be equal to 0. Assuming that the variance in a measurement of E is a\ and in 
a gradient measurement of |^ is 2a^/d x (by applying the variance-propagation (87) to the 
gradient approximations (90) above) we can determine the variance in 0^ and © y . 

We first introduce two new sets of random variables A x (x,y,t) and A y {x,y,t) where 



A x (x,y,t) = [E(x,y,t)-E(x + Ax,y + Ay,t + T)] — 

dE 

A y (x,y,t) = [E(x,y,t)-E(x + Ax,y + Ay,t + T)] — 
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(150) 

{x+Ax,y+Ay,t+T) 

(151) 

(x+Ax,y+Ay,t+T) 



which are simply the terms being summed in (148) and (149). Using the formula for the 
propagation of variances of independent random variables (87) we easily see that 



< = 2 "i«g 



(x+Ax,y+Ay,t+T) 



) 2 + 



A 



(E(x,y,t) - E(x + Ax,y + Ay,t + T)) 2 ] 



< = 2 <^<f 



(x+Ax,y+Ay,t+T) 



) 2 + 



-$(E(x, y, t) - E(x + Ax,y + Ay, t + T)) 2 ]. 
a y 

With these abbreviations we can write 

x,ytP 

Q y = - 2 Y, A>0>y>*)- 

x,ytP 

Again we make use of the variance propagation formula (87) to obtain 



(152) 

(153) 

(154) 
(155) 



BE 



<= 4£-L= 84(E(£ 



x,ycP 



x,yeP 



Bx 



(x+Ax,y+Ay,t+T) 



) 2 + 



^ £ (E(x, y,t)-E(x + Ax,y + Ay,t + T)) 2 ) 

x x,yeP 



(156) 



<= 4E<= s4(E( 



x,ycP 



x,yeP 



BE_ 
By 



(x+Ax,y+Ay,t+T) 



f + 



^ J2 (E(x, y, t) - E(x + Ax,y + Ay, t + T)f) 

V x,ytP 



(157) 



where we recognize the second sum in each of the expressions to be 0. 

On the other hand we can use the variance-propagation (87) on (146) to determine the 
immediate dependency 

4. = ©L'L + o^iy ( 158 ) 

4, = Qlyrtx + QlyOAy (159) 

It remains to substitute the expressions for <7q x and a\ from (157) into the above equations 
and solve for the desired variances a\ x and a\ y of the displacement components. 

_2 _ Q iw <T e x - & *y< T & y _ 8<t| fQ2 (Q > Sp ( BE \2\_ 

a Ax - ©L0L-0L " 0L©L-©iJ°w^ + H?Z,-.A-...*..*.n) * 



'«i"S9 ^xy 



Q ly(§+ £(§f 



x,ycP 



x,ycP 



(x+Ax,y+Ay,t+T) 



(x+Ax,y+Ay,t+T) 
) 2 )] 



(160) 
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(x+Ax^+Aj/.i+Ty 



x,ytP 

(161) 
We can obtain an expression in analogy to the confidence measure suggested by Anandan 
by observing that the pixel distances d x and d y are small quantities so that the parts of 
the numerators which have a coefficient \jd\ or \jd\ will tend to dominate the remaining 
terms. The expressions for the variance are then 

«i, = 84 ea '' ea - e < (162) 

«i, = *r| *» '_' (163) 

xx yy %y 

in which we recognize the denominators to contain the Hessian of which indicates the 
curvature. All expressions containing must of course be evaluated for the estimated 
displacement (Ax*, Ay*). In practice it may be too tedious to approximate all the necessary 
derivatives of when computing the variances so that further simplifications are necessary. 
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